Experiments in PCFG-like Disambiguation of Constituency Parse Forests for Polish

نویسندگان

  • Marcin Wolinski
  • Dominika Rogozinska
چکیده

The work presented here is the first attempt at creating a probabilistic constituency parser for Polish. The described algorithm disambiguates parse forests obtained from the Świgra parser in a manner close to Probabilistic Context Free Grammars. The experiment was carried out and evaluated on the Składnica treebank. The idea behind the experiment was to check what can be achieved with this well known method. Results are promising, the approach presented achieves up to 94.1% PARSEVAL F-measure and 92.1% ULAS. The PCFG-like algorithm can be evaluated against existing Polish dependency parser which achieves 92.2% ULAS. 1 Motivation and Context The main incentive for the present work is the availability of the Składnica treebank of Polish (Woliński et al., 2011; Świdziński and Woliński, 2010)1, which for the first time provides the means to attempt probabilistic parsing of Polish. Składnica is a constituency treebank based on parse forests generated by the Świgra parser and subsequently disambiguated by annotators. The parser generates parse forests representing all possible parse trees for a given sentence. Then the correct tree is marked in the forest by annotators. Including a probabilistic module in the parsing process of Świgra would require tight integration and deep insight into its workings. Therefore, for the present experiments we have taken an approach that is technically simpler. We generate complete forests with unchanged Świgra and then the probabilistic algorithm has to select one of the generated trees. This way the algorithm solves exactly the same problem as annotators of the training corpus. In this paper we present a series of experiments based on Probabilistic Context Free Grammars as a method for assigning probabilities to parse trees. 2 Scoring the Results For evaluating disambiguated parses we use the PARSEVAL precision and recall measures (Abney et al., 1991), which count correctly recognised phrases in the 1 http://zil.ipipan.waw.pl/Składnica algorithm output. A phrase, represented in the constituency tree by an internal node, is correct iff it has the right non-terminal and spans the correct fragment of the input text (it has the correct yield). Precision and recall is computed across the whole set of sentences being processed: Precision = number of correct nodes number of nodes selected by the algorithm Recall = number of correct nodes number of nodes in training trees In all experiments described below the values of precision and recall are close to each other (within 1 percentage point). This is not very surprising: the trees selected by the algorithms are close in the number of nodes to the training trees. So usually when a node is selected that should not be (spoiling precision), some of the nodes that should be selected is not (spoiling recall). For that reason we present the results in the aggregated form of F-measure (harmonic mean of precision and recall). Non-terminals in Składnica are complex terms. The label of a nonterminal unit (e.g., nominal phrase fno) is accompanied by several attributes (10 in the case of fno: morphological features such as case, gender, number, and person, as well as a few attributes specific to the grammar in use). We provide two variants of F-measures: taking into account only whether the labels of non-terminal units match – reported as FL or requiring a match on all attributes – FA. We count the measures against internal nodes of the trees only, that is nonterminals. The terminals, carrying morphological interpretations of words, are unambiguous in the manually annotated corpus. Składnica contains information about heads of phrases, which makes it easy to convert constituency trees to (unlabelled) dependency trees. We perform such a conversion to count unlabelled attachment score (ULAS, the ratio of correctly assigned dependency edges) for resulting trees. This allows us to compare our results with those of Wróblewska andWoliński (2012). We do not use Wróblewska’s procedure for converting the trees to labelled dependency trees since it contains some heuristic elements that could influence the results. In all the reported experiments ten-fold cross validation was used. Składnica contains trees for about 8000 sentences. This set was randomly divided into ten parts. In each of ten iterations nine parts were used for building the model and the remaining one to evaluate it. 3 Monkey Dendrologist – the Baseline For the baseline of our experiments we have selected the following model. The task at hand mimics the work of annotators (called dendrologists by the authors of Składnica), so for the baseline we want to mimic a dendrologist who performs disambiguation by taking random decisions at each step. In a shared parse forest typically only some nodes are ambiguous. These nodes have more than one decomposition into smaller phrases in the tree. This situation corresponds to the possibility of using more than one grammar rule to obtain the given node. Disambiguation can be seen as deciding for each ambiguous node which rule to take. In the tree in Fig. 1 ambiguous nodes are marked with rows of tiny rectangles with arrows (which allow to select various realisations in the search tool of Składnica). Each rectangle represents one realisation of the given node. In this tree 5 of 35 internal nodes are ambiguous. A “monkey dendrologist” considers the ambiguous nodes starting from the root of the tree and for each of them selects with equal probabilities one of possible realisations. Note that these decisions are not independent: selecting a realisation for a node determines the set of ambiguous nodes that have to be considered in its descendant nodes. Ambiguous nodes that lay outside of these selected subtrees will not even be considered. A variant of monkey dendrologist is a “mean monkey dendrologist”. This one when considering a node first checks in the reference treebank which variant is correct and then selects randomly from the other variants. The following table presents disambiguation quality of monkey dendrologists:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Constructing parse forests that include exactly the n-best PCFG trees

This paper describes and compares two algorithms that take as input a shared PCFG parse forest and produce shared forests that contain exactly the n most likely trees of the initial forest. Such forests are suitable for subsequent processing, such as (some types of) reranking or LFG fstructure computation, that can be performed ontop of a shared forest, but that may have a high (e.g., exponenti...

متن کامل

A Probabilistic Context-free Grammar for Disambiguation in Morphological Parsing

One of the major problems one is faced with when decomposing words into their constituent parts is ambiguity: the generation of multiple analyses for one input word, many of which are implausible. In order to deal with ambiguity, the MORphological PArser MORPA is provided with a probabilistic context-free grammar (PCFG), i.e. it combines a "conventional" context-free morphological grammar to fi...

متن کامل

Parsing with Context - Free Grammars and WordStatistics

We present a language model in which the probability of a sentence is the sum of the individual parse probabilities, and these are calculated using a probabilistic context-free grammar (PCFG) plus statistics on individual words and how they t into parses. We have used the model to improve syntactic disambiguation. After training on Wall Street Journal (WSJ) text we tested on about 200 WSJ sente...

متن کامل

Polish LFG treebank on a shoestring

In the paper we present a method of partial disambiguation of an LFG parsebank produced by the Polish LFG grammar POLFIE. The method is based on the grammatical information retrieved from Składnica treebank consisting of the same set of sentences. As a result we obtain a parsebank consisting of significantly smaller forests of LFG structures that can be fully disambiguated by a human annotator ...

متن کامل

Morphology and Reranking for the Statistical Parsing of Spanish

We present two methods for incorporating detailed features in a Spanish parser, building on a baseline model that is a lexicalized PCFG. The first method exploits Spanish morphology, and achieves an F1 constituency score of 83.6%. This is an improvement over 81.2% accuracy for the baseline, which makes little or no use of morphological information. The second model uses a reranking approach to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013